# Multimodal Fusion
Videollama2.1 7B AV CoT
Apache-2.0
VideoLLaMA2.1-7B-AV is a multimodal large language model focused on audio-visual question answering tasks, capable of processing both video and audio inputs to provide high-quality question answering and description generation.
Video-to-Text
Transformers English

V
lym0302
34
0
Hunyuanvideo I2V
Other
Hunyuan Video-I2V is a novel image-to-video generation framework, extended from Tencent's Hunyuan Video model, supporting high-quality video generation from static images.
H
tencent
3,272
305
Vit Bart Image Captioner
Apache-2.0
A vision-language model based on BART-Large and ViT for generating English descriptions of images.
Image-to-Text English
V
SrujanTopalle
15
1
SD3.5 Large IP Adapter
Other
This is an IP adapter based on the SD3.5-Large model, capable of using images as input conditions alongside text prompts to generate new images.
Text-to-Image English
S
InstantX
1,474
106
Sdxl.ip Adapter
Apache-2.0
IP-Adapter is an image prompt adapter for text-to-image diffusion models, capable of combining image prompts with text prompts to enhance the relevance and quality of generated images.
Text-to-Image Other
S
refiners
18
0
AA Chameleon 7b Base
A multimodal model supporting interleaved text-image input/output, based on Chameleon 7B model with enhanced image generation capabilities through the Align-Anything framework
Text-to-Image
Transformers English

A
PKU-Alignment
105
8
Linfusion XL
LinFusion is a text-to-image generation model based on diffusion models, capable of producing high-quality images from textual descriptions.
Text-to-Image
L
Yuanshi
37
7
AV HuBERT
A multilingual audio-visual speech recognition model based on the MuAViC dataset, combining audio and visual modalities for robust performance
Audio-to-Text
Transformers

A
nguyenvulebinh
683
3
Chattime 1 7B Base
Apache-2.0
ChatTime is an innovative multimodal time series foundation model that treats time series as a foreign language, unifying the processing of bimodal input/output for both time series and text.
Multimodal Fusion
Transformers

C
ChengsenWang
700
4
Consistentid
MIT
ConsistentID is a multimodal fine-grained identity-preserving portrait generation model capable of producing portraits with extremely high identity fidelity while maintaining diversity and text controllability.
Text-to-Image Other
C
JackAILab
176
8
Music Generation Model
Apache-2.0
This is a hybrid model created by merging a text generation model and a music generation model, capable of handling both text and music generation tasks.
Text-to-Audio
Transformers

M
nagayama0706
27
1
Instructblip Flan T5 Xxl 8bit
MIT
BLIP-2 is a vision-language model based on Flan T5-xxl, pretrained by freezing the image encoder and large language model, supporting tasks like image caption generation and visual question answering.
Image-to-Text
Transformers English

I
Mediocreatmybest
18
1
YOLO LLaMa 7B VisNav
Other
This project integrates the YOLO object detection model with the LLaMa 2 7B large language model, aiming to provide navigation assistance for visually impaired individuals in their daily travels.
Multimodal Fusion
Transformers

Y
LearnItAnyway
19
1
Timesformer Bert Video Captioning
A video caption generation model based on Timesformer and BERT architectures, capable of generating descriptive captions for video content.
Video-to-Text
Transformers

T
AlexZigma
83
3
Blip2 Flan T5 Xxl
MIT
BLIP-2 is a vision-language model that combines an image encoder with a large language model for image-to-text tasks.
Image-to-Text
Transformers English

B
LanguageMachines
22
1
Fusecap Image Captioning
MIT
FuseCap is a framework specifically designed for generating semantically rich image captions, leveraging large language models to produce fused image descriptions.
Image-to-Text
Transformers

F
noamrot
2,771
22
Raos Virtual Try On Model
Openrail
A virtual try-on system built on the Stable Diffusion framework, integrating DreamBooth training, EfficientNetB3 feature extraction, and OpenPose pose detection technologies
Image Generation
R
gouthaml
258
41
Bbsnet
MIT
BBS-Net is a deep learning model for RGB-D salient object detection, employing a bifurcated backbone strategy network architecture that effectively processes RGB and depth image data.
Image Segmentation
Transformers

B
RGBD-SOD
21
3
Blip2 Flan T5 Xxl
MIT
BLIP-2 is a vision-language model that combines an image encoder with the large language model Flan T5-xxl for image-to-text tasks.
Image-to-Text
Transformers English

B
Salesforce
6,419
88
Blip2 Opt 2.7b Coco
MIT
BLIP-2 is a vision-language pretrained model that guides language-image pretraining by freezing the image encoder and large language model.
Image-to-Text
Transformers English

B
Salesforce
3,900
9
Blip2 Opt 6.7b
MIT
BLIP-2 is a vision-language model based on OPT-6.7b, pretrained by freezing the image encoder and large language model, supporting tasks such as image-to-text generation and visual question answering.
Image-to-Text
Transformers English

B
Salesforce
5,871
76
Blip2 Flan T5 Xl
MIT
BLIP-2 is a vision-language model based on Flan T5-xl, pre-trained by freezing the image encoder and large language model, supporting tasks such as image captioning and visual question answering.
Image-to-Text
Transformers English

B
Salesforce
91.77k
68
Wavyfusion
Openrail
A stable diffusion-based text-to-image generation model supporting creative image generation
Image Generation English
W
wavymulder
454
170
Wav2vec2 2 Bart Large
This model is an automatic speech recognition (ASR) model fine-tuned on the librispeech_asr-clean dataset, based on wav2vec2-large-lv60 and bart-large
Speech Recognition
Transformers

W
patrickvonplaten
31
5
Featured Recommended AI Models